Genomic Profiling of Insecticide Resistance in Malaria Vectors: Insights into Molecular Mechanisms

Malaria control faces challenges from widespread insecticide resistance in major Anopheles species. This study, employing a cross-species approach, integrates RNA-Sequencing, whole-genome sequencing, and microarray data to elucidate drivers of insecticide resistance in Anopheles gambiae complex and An. funestus. Findings show an inverse relationship between genetic diversity and gene expression, with highly expressed genes experiencing stronger purifying selection. These genes cluster physically in the genome, revealing potential coordinated regulation. We identified known and novel candidate insecticide resistance genes, enriched in metabolic, cuticular, and behavioural functions. We also present AnoExpress, a Python package, and an online interface for user-friendly exploration of resistance candidate expression. Despite millions of years of speciation, convergent gene expression responses to insecticidal selection pressures are observed across Anopheles species, providing crucial insights for malaria vector control. This study culminates in a rich dataset that allows us to understand molecular mechanisms, better enabling us to combat insecticide resistance effectively.


Introduction
Malaria remains a leading cause of worldwide morbidity and mortality, resulting in an estimated 619,000 deaths in 2021 [1].Insecticide-based vector control interventions are the most successful tools for reducing malaria incidence, responsible for over 80% of the malaria cases averted between 2000 and 2015 [2]; however, the gains made in malaria control have plateaued in the last eight years and cases are beginning to rise [1].The reversal of gains is partially due to widespread insecticide resistance within populations of the anopheline mosquito vector [3], reducing the e cacy of insecticide-treated bed neds (ITNs) and indoor residual spraying (IRS), the two primary malaria intervention tools [1,4].
Across sub-Saharan Africa, four mosquito species are responsible for the majority of malaria transmission, the An.gambiae species complex (An.gambiae, An. arabiensis, and An.coluzzii) and An.funestus [5][6][7].Understanding the similarities and differences in resistance mechanisms between the different dominant vector species is critical to inform malaria mitigation strategies, which are mostly employed indiscriminately despite differences in vector composition between and within areas of sub-Saharan Africa.
Sharing of resistance mechanisms between insects is commonly observed; mutations at the insecticide target site genes Rdl, Kdr and Ace-1 occur at equivalent codons throughout arthropods [8,9], and insecticide detoxi cation is often driven by cytochrome P450s from the CYP6 insect clade [10]; this may occur through parallel evolution, or in closely related insects, introgression, where genetic material is passed between species via hybridisation.Although adaptive introgression is seen at resistance loci within the An.gambiae species complex [11,12], there is no evidence of hybridisation or introgression with An. funestus, which diverged approximately 80 million years ago [13,14].
Insecticide resistance in Anopheles mosquitoes commonly involves large gene families, such as cytochrome P450 monooxygenases (P450s), glutathione S-transferases (GSTs), carboxylesterases (COEs), and UDP-glycosyltransferases (UDPs) [15,16].Upregulation of key P450s has been reported in insecticide resistant populations of each Anopheles species, for example CYP6Z1, CYP9K1, CYP6AA1 and CYP6P9a in An. funestus [17][18][19]; CYP9K1, CYP6AA1, CYP6P3 and CYP6M2 in An. gambiae and An.coluzzii [20][21][22]; and CYP6P4 in An. arabiensis [23].Similarly, cuticular resistance [24], which involves thickening and modi cation of the composition of the cuticle, slowing the rate of absorption of insecticides has been reported in both An.gambiae s.l [25] and An.funestus [26].Work comparing distinct Anopheles species found that gene numbers within these families were similar across species but lineage-speci c losses and gains are regularly observed [14].Despite these similarities, differences between the An.gambiae species complex and An.funestus are also observed.A prominent example is single nucleotide polymorphisms (SNPs) in the target site of insecticide, which reduce the e cacy of insecticides through changes to binding a nity.The two best studied are Kdr and Ace-1 [27,28], both of which are found in the An.gambiae species complex, but which are seemingly absent in An. funestus.Novel mechanisms of resistance have also recently been discovered, including the direct binding of pyrethroids by chemosensory proteins (CSPs) [29], as well as evidence for a role of both hexamerins, αcrystallins [30].The recent discovery of such mechanisms highlights the complexity of resistance and the need for further study into molecular mechanisms of resistance in major malaria vectors.
A plethora of 'omics data has now been generated for malaria vectors.Over the past two decades, transcriptomic studies have driven discoveries into mechanisms of insecticide resistance, rst in the form of microarrays, and subsequently RNA-Sequencing.In addition, since publishing the rst phase of the Anopheles 1000 genomes project in 2017 [31], the MalariaGEN Vector Observatory has generated and made public thousands of high-quality whole-genome sequences of major malaria vectors from throughout sub-Saharan Africa.Although individual -omics experiments have had great success in identifying highly over-expressed transcripts involved in resistance, these studies result in thousands of differentially expressed genes resulting from noise from the susceptible comparator.Previous work has successfully used microarray data in exploratory analyses to identify patterns of gene expression across the An.gambiae species complex, identifying meta-signatures across geographically and temporally disparate data, highlighting the importance of meta-analyses [29,30].In this study we characterise geneexpression pro les from published transcriptomic studies across two malaria vector complexes and establish relationships with whole-genome sequence data.To achieve this, we explored expression in resistance candidate genes and protein-coding families across 28 individual experiments of An. gambiae s.l and An.funestus [32][33][34][35][36][37][38].These data have been integrated both with the previous microarray metaanalysis [30] and with whole-genome sequence data from the Anopheles 1000 genome project [31].The meta-analysis performed here has been made available as a resource to the community as a user-friendly python package, AnoExpress, combined with convenient interactive notebooks intended to be run in Google Colaboratory.

AnoExpress
We performed a differential expression meta-analysis on read count data from 28 RNA-sequencing datasets representing insecticide-resistant Anopheles populations from 11 countries in sub-Saharan Africa [32][33][34][35][36][37][38] (Fig. 1, Supplementary Table 1).Of the 28 resistant populations, 15 are An.coluzzii, 3 are An.gambiae, 5 are An.arabiensis and 5 are An.funestus.Read-count data from different species were derived from aligning to each respective reference genome assemblies.For differential expression analysis, each resistant population was compared to an insecticide-susceptible strain of the same species from the same study.To enable cross-species comparison of putative resistance genes, we located orthologs between each different genome assembly and the AgamP4 PEST reference genome, using the OrthoMCL algorithm in VectorBase.As the inclusion of each successive assembly reduces the numbers of genes that are present, we provide four differential expression analyses; the full dataset, termed 'gamb_colu_arab_fun' (8599 genes), and secondary analyses -'gamb_colu_arab' (8651 genes), 'gamb_colu' (11288 genes), 'fun' (14176 genes).Analyses herein were performed on the full gamb_colu_arab_fun dataset unless speci ed.These data are presented within a python package, AnoExpress, made for the community, with example notebooks provided to run in Google Colab.Users can load, explore, and visualise the gene expression meta-analysis data, including reproducing most analyses presented herein.AnoExpress can also directly load gene expression data from an earlier meta-analysis of microarray data [30].AnoExpress is located here: github.com/sanjaynagi/AnoExpress.Documentation and a video user-guide are provided to improve ease of access.

Dataset structure
To investigate overall structure in the dataset, we performed principal components analysis on both the count and fold-change data.PCA on the log-transformed count data revealed ve distinct clusters present.Four clusters contain samples only from An. funestus, An. coluzzii, or An.arabiensis, with a fth cluster containing a mixture of An. gambiae, An. arabiensis and An.coluzzii (Fig. 1A).The four species did not resolve into four distinct clusters, suggesting that batch effects from different studies could be present.We therefore only performed within-study differential expression analysis.To compare RNA-Sequencing data with the earlier microarray studies, we performed a PCA on the fold-change data from all available experiments (Supplementary Fig. 1).There was no evidence of clustering between the two technologies, suggesting that the results between the two methodologies are comparable and can be used in combined analyses.
Correlations between pairs of orthologous genes count data from different species were high, with lower correlation between An. funestus and the other three species, which is to be expected given their considerable divergence time (Fig. 1B).

Gene expression is physically clustered in the genome
As changes in gene expression are likely to be related to cis-acting factors and the local genetic context, genes in close physical proximity are more likely to show similar patterns of expression.To test this within our dataset, we calculated and averaged the Pearson's correlation on log2 fold-changes between each neighbouring gene in the dataset, resulting in a mean Pearson's R of 0.10 for neighbouring genes.To determine whether this value was higher than expected based on chance, we randomly permuted the positions of all genes in the genome 1000 times and re-calculated the mean Pearson's R for now neighbouring genes.Figure 2C shows the histogram of mean Pearson's correlations, demonstrating a clear effect of genome proximity on gene expression.
To further explore physical clustering of resistance-related gene expression, we visualised fold-change values across the whole genome.In Fig. 3, we use AnoExpress to plot gene expression for all genes across all comparisons in the gamb_colu_arab_fun analysis and calculate the moving average log2 fold change across the genome in sliding windows.We identify gene clusters known to be associated with insecticide resistance, such as the CYP6P, CYP9K, CYP6M/Z and GSTD gene clusters.This analysis also highlighted numerous clusters of cuticle proteins which show elevated expression.

Genetic diversity and purifying selection scale with levels of gene expression
We postulated that genes showing higher levels of absolute expression are more likely to be functional in multiple tissues or pathways, and so should be highly conserved, experiencing the highest levels of purifying selection.We integrated whole-genome sequence data from the Anopheles 1000 genomes project [31], to explore the relationship between gene expression and genetic diversity, for each species in which genomic data was publicly available.To compare between species, we selected representative populations of An. gambiae (Nagongera, Uganda), An. coluzzii (Hauts-Bassins, Burkina Faso), and An.arabiensis (Muleba, Tanzania), collected between 2012 and 2015.These cohorts show no signatures of inbreeding of major demographic events.As a measure of absolute expression, we used normalised count data.We then counted the number of segregating sites of synonymous and non-synonymous mutations, as well as calculating nucleotide diversity, π, in each gene, for each species.
Figure 4 shows the ratio of pN/pS and nucleotide diversity, for various levels of expression binned into 5% percentiles for the gamb_colu_arab analysis.We show that average nucleotide diversity is reduced for the most highly expressed genes and is lower overall in An. arabiensis compared to An. coluzzii and An.gambiae, which ts with expectations on genetic diversity from the literature [39].
We show that for each species, the pN/pS ratio is lower for the most highly expressed genes compared to lowly expressed genes.This holds more strongly in An. gambiae and An.coluzzii than An.arabiensis, which likely relates to the e ciency of purifying selection depending on population size -the effects of purifying selection will be stronger in larger populations.
Highly overexpressed genes are associated with selective sweeps in wild-caught mosquitoes Given that insecticide resistance is often associated with increased expression of metabolic genes and that bene cial resistance mutations can cause selective sweeps which spread through a population, we hypothesised that highly expressed genes are more likely to be found in proximity to genomic regions under selection in wild-caught mosquitoes.
To determine this, we investigated genome-wide signals of recent selection using the H12 statistic [40] on data from phase 3 of the Anopheles 1000 genomes project (Nagi et al., in prep).Every signal was then mined to determine which genes in the AgamP4 assembly, if any, lie at the location of the peak of a selection signal.Further information on genome-wide selection scans and signal calling is found in Supplementary Text 1. Candidate genes were then de ned as those genes showing a median fold-change of greater than two in the An.gambiae, An. coluzzii and An.arabiensis dataset (gamb_colu_arab), as these three species are represented in the Ag1000G phase 3 data [38].We additionally ltered out lowly expressed genes (median counts < 5).After ltering, 106 candidate genes remained, of which 34 were located at the site of a signal of selection (Supplementary Table 2).Figure 5 displays the location of these genes in the An.gambiae PEST reference genome, and recovers known insecticide resistance candidates such as CYP6P3, CYP6M2, GSTE2 and CYP9K1, as well as numerous other genes putatively involved in resistance.
To determine whether this was a signi cant enrichment of genes compared to random chance, we carried out a permutation analysis.In 10000 permutations, only 2 equalled or exceeded the value of 34 (p-value = 0.0002), suggesting that regions of the genome under selection are indeed enriched for the most highly overexpressed genes.Separately, we utilised the entire dataset of genes and found a positive association between a genes' median fold-change and whether it was found at the site of an H12 selective sweep signal (p GLM = 0.000).

Signatures of resistance-associated gene expression
To explore highly over-expressed genes across all four species, we ranked genes by mean and median log 2 fold change across all experiments (Supplementary Tables 3 and 4).Included in the over-expressed set (cut-off of fold change > 2, median counts > 5), are 65 and 78 genes respectively.Within these lists, the known pyrethroid metaboliser CYP6P3 [39] (the ortholog of CYP6P9a/b in An. funestus) had a median and mean fold change of > 6.8, demonstrating the utility of the meta-analytic approach.Included in both lists were other cytochrome P450s including CYP6P4, CYP9K1, CYP6M2, CYP6Z2, CYP6Z3 and CYP6P5, all of which have previously been linked to pyrethroid resistance across both species [20,21].Interestingly, CYP4H17 which has never been explored for a role in pyrethroid resistance, appears 7th (mean) and 11th (median) with a median fold change of 3.47.Similarly appearing high across both mean and median rankings are cuticular related proteins CPLCG4, CPR150 and CPLCA3, a venom allergen (AGAP006417), an alkaline phosphatase (AGAP001684), gustatory receptor 49 (AGAP001169) and a protein with no known function (AGAP009327).In total, 50 genes are present in both the median and mean over-expressed sets including one carboxylesterases, 11 cytochrome P450s, three glutathione-S-transferases and one UGT from the major detoxi cation enzyme families (Fig. 6A).Additionally, 9 cuticular proteins, one D7 salivary gland protein and two hexamerins are present from gene families previously linked to resistance [30,41].Two genes related to odorant or taste detection are also present.
We then explored highly down-regulated genes based on median fold-changes (Fig. 6C).Amongst the most down-regulated genes was CYP307A1, a known regulator of ecdysone synthesis [42].In addition to hormone related transcripts CLIPA13 and Galectin 6 are heavily down-regulated.Zero GO terms were enriched based on the 5% lowest median fold-changes.If we instead use the 5% lowest mean foldchanges, we observe signi cant enrichment for sodium channel activity (p = 0.00013), sodium ion transport (p = 0.00059) and sodium ion transmembrane transport (p = 0.0044) the target sites for pyrethroids.

Consistently differentially expressed genes
We then explored genes which were signi cantly up and down-regulated most consistently.
In contrast, only 8 genes showed consistent down-regulation in at least 19 out of 28 experiments (Supplementary Table 6), including two opsin genes, GPROP4 and GPROP6, and a molecular chaperone HtpG gene (AGAP006961).The genes are enriched in seven GO terms relating to light detection, such as photoreceptor activity (p = 0.029), visual perception (p = 0.029) and phototransduction (p = 0.039).

Detoxi cation genes
In total 18 cytochrome P450s show either a median and mean fold change of > 2 and 9 are signi cantly upregulated in at least 19 out of 28 of all datasets.The cytochrome P450 family most represented is the CYP6 family, orthologous to the CYP3 clade in humans, responsible for the vast majority of xenobiotic metabolism.The CYP6Ps, CYP6M2 and CYP9K1 which have been repeatedly associated with insecticide resistance are all heavily implicated and are located in regions of the genome under selection (Fig. 5).Additionally, four members of the CYP4 family appear to be overexpressed across multiple resistant populations, particularly CYP4H17, but also CYP4H16, CYP4H18, and CYP4C28.
Similarly, 4 out of 27 GSTs (Fig. 6) have high mean or median fold-changes with one, GSTD3 showing consistently signi cant overexpression.Of these, GSTE2 has been shown to be involved in pyrethroid resistance [38], whilst GSTE1, 2 and 5 have been implicated in DDT resistance [43].Of the other phase two and three detoxi cation families, two UGTs (AGAP006222 and AGAP011564) and two carboxylesterases (COEAE6O and COEAE8O) are expressed with high mean or median expression or consistently over-expressed (Fig. 6).

Cuticular Proteins
113 cuticular proteins were included in the full gamb_colu_arab_fun analysis; after ltering for lowlyexpressed genes, only 61 of these remained.20 of these genes showed a mean or median fold-change of above 2 in resistant populations, and although none showed signi cant over-expression in 19/28 experiments, 9 showed upregulation in at least 16, including CPLCX3, CPR130, CPAP3-C, CPR16, CPR140, CPR59, CPR79, CPLCP12 and CPR81.

Expression of neuronal-related genes
Gene-set enrichment analysis of GO terms on the 5% most down-regulated genes for median expression revealed signi cant enrichments in synapse (15 genes), sodium channel activity (10 genes) and chemical synaptic transmission (13 genes).The primary target site for pyrethroid insecticides, the Vgsc, is not present in the full gamb_colu_arab_fun analysis, but investigation of the secondary analyses shows sporadic down-regulation in An. coluzzii and An.funestus but not An.gambiae, suggesting that changes in expression of this gene are unlikely to be playing a role in resistance.Interestingly, Ace-1 expression demonstrated up-regulation in almost all An.coluzzii and An.gambiae populations whilst it is downregulated in all 5 of the An.funestus populations.The gene coding for the beta-2 subunit of the nicotinic acetylcholine receptor is signi cantly overexpressed in 11 out of 28 experiments.

Other a priori resistance candidates
A number of genes and gene families have recently been implicated in insecticide resistance, including CSPs, hexamerins, alpha-crystallin, the transcription factor Maf-S and d7 salivary proteins [29,30,41,44].Of the 8 chemosensory proteins, just three remained in the full gamb_colu_arab_fun analysis, CSP3, CSP4 and SAP1, none of which showed high mean or median fold-changes across four species.Similarly, three hexamerins were included in these data, all three of which showed high mean fold changes ranging from 2.64 to 3.4.Further exploration revealed that overexpression was observed in all An.arabiensis and the majority of An. gambiae experiments, with the orthologs of AGAP001345 and AGAP001657 being overexpressed in two (Uganda and FuMoz) out of ve An. funestus datasets.Just two a-crystallins were present in the data, neither of which showed an expression-association with resistance.The transcription factor Maf-S was previously identi ed as up-regulated across multiple An.gambiae s.l populations in the meta-analysis of microarray data [30].It also shows similarly consistent overexpression in the RNA-Sequencing data, with 12 out of 28 populations exhibiting signi cant up-regulation.

Transcriptional regulation of insecticide resistance
Previous work using transcriptomic time-course data revealed putative transcription factors regulating the expression of transcripts post-pyrethroid exposure [45].Here, we used the GENIE3 algorithm in the Grenadine package to re-capitulate a gene regulatory network utilising the 28 RNA-sequencing and including 31 microarray experiments.After ltering score for the top 5th percentile (Supplementary Table 7), 50,967 interactions were predicted.Firstly, we investigated Maf-S due to its prior link with insecticide resistance [44].269 putative interactions were identi ed, including the pyrethroid-resistance related transcripts CYP6M2, CYP6Z3, CYP9J4, two ABC transporters and three GSTs including GSTMS1.To validate the utility of this approach, microarray data published from a Maf-S knockdown [44] was compared with the model predictions; 83 out of the predicted 269 genes were present on the list.We then carried out a permutation analysis, and out of 1000 permutations only 18 had 83 or more overlaps (p = 0.018) giving con dence to the model predictions.

Discussion
Despite the wide availability of transcriptomic data for major malaria vectors, no study to date has analysed these data across the four primary vectors, An. gambiae, An. coluzzii, An. arabiensis and An.funestus.In this study we performed a meta-analysis of these data, demonstrating clear convergence of molecular mechanisms of known insecticide resistance genes, as well as discovering novel candidates showing high levels of expression across the divergent species.We present a user-friendly python package, AnoExpress, which allows users to explore, visualise, and analyse the transcriptomic metaanalysis data.
Integrating differential expression and single nucleotide polymorphism (SNP) data has the potential to enhance the discovery of novel resistance markers [46,47].We integrated whole-genome sequence data from the Ag1000G with our transcriptomic meta-analysis, estimating signals of positive selection in wild mosquitoes, and locating the intersection of genes that both are highly overexpressed and are located at sites of selection signals.As over-expression of genes is a common mechanism which can confer insecticide resistance, cis-acting mutations which are likely to increase the expression of nearby genes are expected to be under positive selection.Indeed, cis-acting factors have been identi ed on a triplemutant haplotype around CYP6AA1 in East African An. gambiae [22], and in An. funestus driving expression of CYP6P9a/b [48].
The transcriptomic meta-analysis recovered many known resistance genes from metabolic gene families, including CYP6P3 (the CYP6P9a/b ortholog), CYP6P4, CYP6AA1, CYP9K1, CYP6Z2, CYP6Z3, CYP6M2 and Gste2 [20][21][22].Each of these cytochrome P450s have been shown to directly metabolise insecticides [20,49], as well as being associated with selective sweeps in eld populations [19,31].Additionally, several detoxi cation genes showing consistent overexpression have recently been implicated in gene duplication events in the An.gambiae complex, including CYP9K1, CYP6AA1, COEAE60 and GSTE2-4 [22,50].The appearance of these genes in both candidates with consistently high mean/median fold change, as well as those consistently expressed over multiple datasets demonstrates the power of this approach.CYP4H17 is a striking example of a gene which seems to have been missed in earlier studies and is a promising candidate for functional validation in both An.gambiae s.l and An.funestus.
In addition to detoxi cation genes, we see a signature of oxidative stress response.For example, Maf-S is involved in the Nrf2-cnc pathway, which induces expression of stress response genes upon changes to the oxidative stress levels of the cell [44].Interestingly, Maf-S has been shown to have a role in pyrethroid resistance and was identi ed from consistent overexpression in microarray datasets [44] and is similarly consistently overexpressed in the RNA-Sequencing data, whilst another member of the pathway, Keap1, is upregulated in An. funestus from Ghana.In addition to this characterised pathway, protein DJ-1 (AGAP000705) and Catalase are consistently overexpressed, both these genes play a role in protection against oxidative stress [51].These data are consistent with previous publications indicating that insecticide exposure induces an oxidative stress response [52][53][54].
Signatures of expression in candidates related to cuticular hydrocarbons were also present, which may indicate a change to lipid processing as recently proposed [55,56] or be related to a thicker cuticle [24].Genome-wide expression scans highlighted multiple gene clusters of cuticular protein families as outliers of gene expression, including the CLPCA, CPLCG, CPLCP and CPR families, which is re ected in the large number of these genes in the top candidate gene lists based on median and mean fold-changes.
AGAP010368, an ortholog to a gene involved in fatty acid alpha-oxidation in Drosophila, a long chain fatty acid CoA ligase (AGAP009159) and a fatty acid elongase (AGAP003195) are consistently over expressed.The large number of cuticular proteins (20/113) highly expressed and the lack of those consistent across all populations may suggest high levels of redundancy.
We observe little differential expression of the major target site genes, such as the target of pyrethroids the Vgsc and dieldrin Rdl.As previously reported, the expression pro le of the OP and carbamate target Ace-1 is an exception, often upregulated in An. coluzzii and An.gambiae yet showing negative foldchanges An. funestus experiments.It is worth noting that all ve An. funestus experiments included in this study use the same susceptible comparator strain, FANG, and so this result could be a peculiarity of this laboratory strain.We nd regular over-expression of the gene coding for the beta-2 subunit of nicotinic acetylcholine receptor (AGAP010057), recently associated with Pirimiphos-methyl resistance in a genome-wide association study [12].
As well as physiological changes which directly confer resistance to insecticides, the potential for mosquitoes to avoid contact with insecticides and exploit humans when they are least protected is a serious threat to malaria vector control.Mosquito behaviour is likely to be driven in part by gene expression [57] and transcriptomic studies may allow us to identify candidate genes involved in insecticide-resistant behaviours.This is of particular interest given the di culty in measuring behavioural resistance in wild mosquito populations.Interestingly, enrichment analyses highlighted that our expression candidate genes were enriched for genes involved in olfactory processes.Gustatory receptor 49 (AGAP001169) shows a median fold-change of 3.39, the highest of any olfactory receptor in our data, and is a promising candidate for behaviourally-linked insecticide resistance.The odorant receptors 13 (AGAP009396) and 9 (AGAP008333) also show high median fold-changes of 2.73 and 2.71, respectively.It is, however, important to note that adaptation of susceptible strains to insectary rearing could cause the downregulation of olfactory genes, and so contribute to these signals.
We also identify candidate genes from gene families not previously associated with resistance in malaria vectors.The ortholog of neural lazarillo (AGAP009281) has a high median fold-change, a gene which is known to contribute to longevity, stress resistance and behavioural change through the JNK stress response pathway in Drosophila [58].Several venom allergens also exhibit high median fold-changes.Interestingly, Venom allergen 5 has been associated with resistance in Culex through dsRNA and overexpression in cells [59].These genes are poorly understood but a number have been found in the salivary gland proteome where they have a diverse range of functions, with none currently being characterised [60].
Trans-acting regulation plays an important role in modulating gene expression to insecticide stress in insects [44,[61][62][63].To explore whether we could identify transcription factors involved in insecticide resistance, we inferred a gene regulatory network.The predicted interactors of several transcription factors were enriched in GO terms related to insecticide resistance.Amongst them is Rootletin, which is known to play a role in behavioural response in Drosophila [64] and was previously linked to resistance as a key hub in response to insecticide exposure [45], whilst Grau has been linked to pyrethroid resistance in Aedes populations [65].Other transcription factors enriched in insecticide resistance-related terms that are consistent with known roles in Drosophila includes Cabut, which has been shown to regulate stress response [66], E(spl)m3-HLH which is linked with stress granules [67], Grh that is involved in cuticular repair [68], Exex is involved in neuronal function [69], Onecut with synapse organisation [70] and Tfam is a mitochondrial transcription factor, with a role in oxidative stress response [71].
We also explored the role that gene expression plays in shaping levels of genetic diversity across the genome.Protein evolution is constrained by purifying selection, which acts to prevent deleterious mutations from spreading in a population [72].Previous studies have shown a relationship between higher gene expression and a slower accumulation of deleterious mutations [46].We nd a similar relationship in Anopheles mosquitoes, with the most highly expressed genes displaying lower rates of pN/pS, as well as a reduction in measures of genetic diversity, and replicate this effect across three mosquito species.Anopheles gambiae s.l populations exhibit extremely large effective population sizes, which should accelerate the removal of deleterious mutations.Between species differences support this, as An.arabiensis displays lower nucleotide diversity, but higher rates of pN/pS for the most highly expressed genes.
In this study, we present a holistic overview of signatures of gene expression between from the major malaria vectors Anopheles gambiae s.l and Anopheles funestus.As well as establishing relationships between gene expression and whole-genome sequence data, we demonstrate the importance and convergence of detoxi cation genes, as well as highlighting putative transcription factors and novel gene families involved in insecticide resistance.Despite the power in this study there are several limitations that should be considered; rstly the An.funestus data is from one experiment, and therefore is likely to have signi cant batch effects.Secondly, Anopheles have a gene/loss factor ve times that of Drosophila [14] resulting in many gene families have one-to-many orthologs which have been excluded from this study, which results in a loss of data.Finally, here we take 'insecticide resistance' as an overall phenotype because the populations grouped in this study have differential resistances to different classes, although all are pyrethroid resistance.After taking these limitations into count, these data highlight novel functional validation and genomic surveillance targets for malaria vector control in Africa.

RNA-sequencing data
All RNA-sequencing data published comparing resistant and susceptible members of An. gambiae, An. coluzzii, An. arabiensis and An.funestus were retrieved from Google Scholar in April 2022.Of the 28 resistant populations studied, 15 are An.coluzzii, 3 are An.gambiae, 5 are An.arabiensis and 5 are An.funestus (Supplementary Table 1) [32][33][34][35][36][37][38].As multiple studies used the same susceptible populations due to the widespread resistance found in endemic settings, replicates of the same populations across studies were present (Supplementary Table 1).Count les were retrieved from authors of each paper and combined to form one large counts le.VectorBase was used to retrieve orthologs from each species to An. gambiae PEST and the counts of one-to-many relationships were averaged.

Differential expression analysis
The data was then normalised using the DEseq2 1.26.0 [73] using estimateSizeFactors, estimateDispersions and the varianceStabilizingTransformation.Differential expression analysis (DEA) was performed with DESeq2 v1.26 in R v3.6.3.Due to batch effects and large differences in library depth between the included RNA-Sequencing studies, we only performed DEA within each experiment, comparing the susceptible (control) to the resistant strain (case).Positive fold changes indicate overexpression in the resistant strain.Hypothesis testing was performed with the DESeq2 wald test.
Microarray data integrated Microarray data from an earlier meta-analysis [30] into AnoExpress.As this was performed at the transcript level rather than genes, we averaged log2 fold change and adjusted p-value data across transcripts for a given gene to match the AnoExpress meta-analysis.

Gene-set enrichment analysis
Gene set-enrichment was performed using the hypergeometric test implementation in Scipy, incorporated into AnoExpress.GO terms were extracted from VectorBase and PFAM domains from [74].

Genomic clustering
Using the gamb_colu_arab analysis we calculated and averaged the Pearson's correlation between log2 fold changes for each neighbouring gene pair in the dataset, and then calculated the mean pearsons R across all pairs.We excluded An. funestus due to differences in synteny compared to An. gambiae s.l.To determine whether this value was higher than expected based on chance, we randomly permuted the positions of all genes in the genome 1000 times, calculated the Pearsons correlation between neighbouring genes, and calculated the mean.We considered the effect signi cant if the fraction of permutations showing a more extreme value than the actual true value was below 0.05.

Genetic diversity
Using log2 count data from the gamb_colu_arab analysis, we calculated the median counts across all experiments, and binned genes into 5% percentiles of expression level.We excluded An. funestus, as su cient whole-genome sequence data is not yet publicly available.We selected three cohorts to analyse from the Anopheles 1000 genome project [31]; An. gambiae from Nagongera, Uganda, An. coluzzii from Hauts-Bassins, Burkina Faso, and An.arabiensis from Muleba, Tanzania, collected between 2012 and 2015.We randomly selected 50 individuals from each cohort.Using segregating sites only, we calculated the number of non-synonymous and synoynmous sites for each transcript, using the rst transcript annotation for each gene, up to -RC.We calculated the overall CDS length per transcript and calculated the number of synonymous or non-synoynmous mutations per 1000bp of CDS.We also calculated nucleotide diversity and Wattersons Theta across the entire length of the gene (including introns), using scikit-allel v1.2.1 [75].

Genome-wide selection scan (GWSS) data
We integrate data from H12 genome-wide selection scans [40] from the selection-atlas.More information on the GWSS and peak-calling algorithm can be found in Supplementary Text 1.

Genome-wide expression scans
We calculate a sliding window mean of log2 fold-change data, for the gamb_colu_arab_fun analysis, with a window size of 10 genes and a step of 2 genes.Genome-wide expression scans.Expression for all genes in the gamb_colu_arab_fun analysis is plotted against the An.gambiae PEST reference genome.Different colours represent different species -blue = An.coluzzii, orange = An.gambiae, green = An.arabiensis and red = An.funestus.A moving average of gene expression is plotted as a black line, calculated in sliding windows of 10 genes, with a step of 2 genes.The Y-axis is truncated between -8 and +10 log2 Fold-change to ease interpretation.Signals in close proximity to known, or putative, IR loci are labelled and highlighted with dashed grey lines.Gene density is displayed below the x-axis.In AnoExpress, plots are interactive, aiding interpretation.Insecticide resistance-associated selection and expression in An. gambiae.The of genes in the AgamP4 PEST reference genome, which are both found in regions of selective sweeps in whole-genome sequence data, and which are also resistance candidates based on average fold-changes in the RNA-Sequencing meta-analysis data.

Figures
Figures

Figure 1 Overview
Figure 1

Figure 2 A
Figure 2

Figure 4 Gene
Figure 4